Dunn Index
   HOME

TheInfoList



OR:

The Dunn index (DI) (introduced by J. C. Dunn in 1974) is a metric for evaluating
clustering algorithm Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
s. This is part of a group of validity indices including the
Davies–Bouldin index The Davies–Bouldin index (DBI), introduced by David L. Davies and Donald W. Bouldin in 1979, is a metric for evaluating clustering algorithms. This is an internal evaluation scheme, where the validation of how well the clustering has been d ...
or Silhouette index, in that it is an internal evaluation scheme, where the result is based on the clustered data itself. As do all other such indices, the aim is to identify sets of clusters that are compact, with a small variance between members of the cluster, and well separated, where the means of different clusters are sufficiently far apart, as compared to the within cluster variance. For a given assignment of clusters, a higher Dunn index indicates better clustering. One of the drawbacks of using this is the computational cost as the number of clusters and dimensionality of the data increase.


Preliminaries

There are many ways to define the size or diameter of a cluster. It could be the distance between the farthest two points inside a cluster, it could be the mean of all the pairwise distances between data points inside the cluster, or it could as well be the distance of each data point from the cluster centroid. Each of these formulations are mathematically shown below: Let ''C''''i'' be a cluster of vectors. Let ''x'' and ''y'' be any two n dimensional feature vectors assigned to the same cluster ''C''''i''. : \Delta_i = \underset d(x,y) , which calculates the maximum distance (the version proposed by Dunn). : \Delta_i = \dfrac \underset d(x,y) , which calculates the mean distance between all pairs. : \Delta_i = \dfrac , \mu = \dfrac , calculates distance of all the points from the mean. This can also be said about the intercluster distance, where similar formulations can be made, using either the closest two data points (used by Dunn), one in each cluster, or the farthest two, or the distance between the centroids and so on. The definition of the index includes any such formulation, and the family of indices so formed are called Dunn-like Indices. Let \delta(C_i,C_j) be this intercluster distance metric, between clusters ''C''''i'' and ''C''''j''.


Definition

With the above notation, if there are ''m'' clusters, then the Dunn Index for the set is defined as: : \mathit_m = \frac .


Explanation

Being defined in this way, the ''DI'' depends on ''m'', the number of clusters in the set. If the number of clusters is not known apriori, the ''m'' for which the ''DI'' is the highest can be chosen as the number of clusters. There is also some flexibility when it comes to the definition of ''d(x,y)'' where any of the well known metrics can be used, like
Manhattan distance A taxicab geometry or a Manhattan geometry is a geometry whose usual distance function or Metric (mathematics), metric of Euclidean geometry is replaced by a new metric in which the distance between two points is the sum of the absolute differences ...
or
Euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefor ...
based on the geometry of the clustering problem. This formulation has a peculiar problem, in that if one of the clusters is badly behaved, where the others are tightly packed, since the denominator contains a 'max' term instead of an average term, the Dunn Index for that set of clusters will be uncharacteristically low. This is thus a worst case indicator, and has to be kept in mind. There are ready implementations of the Dunn index in some vector based programming languages like
MATLAB MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementation ...
, R and
Apache Mahout Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily on linear algebra. In the past, many of the implementations use the ...
.


Notes and references


External links

* * {{Machine learning evaluation metrics Clustering criteria